[1] 18 63 58
Every research project aims to answer a research question (or multiple questions).
Do ECU students who exercise regularly have a higher GPA?
Each research question aims to examine a population.
Population for this research question is ECU students.
It is impossible to study the whole population related to a research question.
A sample \(n\) is a subset of the population \(N\).
The Goal: Select a representative sample to generalize to the broader population.
What is representative?
Data quality matters more than data quantity
Many anthropological studies (or similar) are convenience based.
Every member of a population has an equal chance of being selected.
To Generalize:
Similar to a simple random sample BUT intervals are chosen at regular intervals.
# 1. Create a population (e.g., a vector of 1 to 1000)
population <- 1:1000
# 2. Define the desired sample size
sample_size <- 100
# 3. Calculate the sampling interval (k)
N <- length(population) # Population size
k <- N / sample_size
# If k is not an integer, you might use ceiling(N/n) and adjust the logic
# 4. Choose a random starting point (r) between 1 and k
set.seed(123) # Optional: for reproducible results
start_point <- sample(1:k, 1)
# 5. Select every k-th element starting from the random start point
systematic_sample_indices <- seq(from = start_point, to = N, by = k)
systematic_sample <- population[systematic_sample_indices]
# 6. View the first few elements and the dimension of the sample
head(systematic_sample)[1] 3 13 23 33 43 53
[1] 100
set.seed(123)
population <- data.frame(
Supermarket = paste("Supermarket", 1:1000, sep = "_"),
CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)
selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)
sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]
head(sampled_data) Supermarket CustomerSatisfaction
203 Supermarket_203 72.34855
225 Supermarket_225 71.36343
255 Supermarket_255 90.98509
354 Supermarket_354 76.16637
457 Supermarket_457 86.10277
554 Supermarket_554 77.49825
set.seed(123)
region <- data.frame(
Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)
households <- data.frame(
Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
HouseholdID = rep(1:20, times = 500),
EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)
selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)
sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]
head(sampled_households) Neighborhood HouseholdID EmploymentStatus
1981 Neighborhood_302 1 Unemployed
1982 Neighborhood_302 2 Employed
1983 Neighborhood_302 3 Employed
1984 Neighborhood_302 4 Employed
1985 Neighborhood_302 5 Unemployed
1986 Neighborhood_302 6 Unemployed
set.seed(123)
states <- data.frame(
State = paste("State", 1:50, sep = "_"),
Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
County = rep(paste("County", 1:20, sep = "_"), times = 50),
VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers) State County VaccinationRate
8 State_32 County_8 70.37428
11 State_32 County_11 66.86024
13 State_32 County_13 70.81309
15 State_32 County_15 67.68222
19 State_32 County_19 70.91839
28 State_46 County_8 68.84869
How do we infer future events or population characteristics?
In a random process there is more than one possible outcome.
The set of all possible outcomes of a random process.
An event is a subset of the sample space.
Examples with a 6-sided die:
A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}The set of all outcomes in the sample space that are not in the event itself.
Example:
C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}= {1, 4, 6}A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}Events \(A\) and \(B\) are mutually exclusive because an outcome cannot be both even + odd.
Events \(A\) and \(C\) are not mutually exclusive because the outcome 2 is both even + prime.
| Description | Notation | Reading | Elements |
|---|---|---|---|
| Union | \(A \cup C\) | A or C | {2, 3, 4, 5, 6} |
| Intersection | \(A \cap C\) | A and C | {2} |
set.seed(1)
OBV <- 1:10
Dist1 <- NULL
Dist9 <- NULL
Dist16 <- NULL
Dist25 <- NULL
Dist36 <- NULL
count = 100
while(count > 0){Dist1 <- c(Dist1,sample(OBV, 1, replace = TRUE)); count <- count - 1}
count = 100
while(count > 0){Dist9 <- c(Dist9,mean(sample(OBV, 9,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist16 <- c(Dist16,mean(sample(OBV, 16,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist25 <- c(Dist25,mean(sample(OBV, 25,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist36 <- c(Dist36,mean(sample(OBV, 36,replace = TRUE) ) ); count <- count - 1}
Dist.df <- data.frame(Size = factor(rep(c(1,9,16,25,36), each=100)), Sample_Means = c(Dist1, Dist9, Dist16, Dist25, Dist36) )
ggplot(Dist.df, aes(Sample_Means, fill = Size)) + geom_histogram() + facet_grid(. ~ Size)The likelihood of some event occurring…
The probability of an outcome is defined to be the proportion of times the outcome is observed under high number of repetitions of the random process.
Assume that we are repeating the random process of a coin flip and are recording \(X\), the number of heads in \(n\) coin flips. Then:
\[ P(H) = \lim_{n\to\infty}\frac{X}{n} \]
\[ P(H) =\frac{1}{2} \]
set.seed(42) # for reproducibility
# Number of coin flips
n_flips <- 10000
# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)
# Compute cumulative mean of rolling a '1'
cumulative_mean <- cumsum(flips == "H") / (1:n_flips)
# Plot convergence
plot(1:n_flips, cumulative_mean, type = "l", col = "blue", lwd = 2,
xlab = "Number of Flips", ylab = "Proportion of Heads",
main = "Law of Large Numbers: Convergence to 1/2")
abline(h = 1/2, col = "red", lty = 2, lwd = 2) # Reference line at 1/2The probability of an outcome is a degree of belief or reasonable expectation quantifying one’s state of knowledge based on observed events and prior knowledge.
set.seed(42) # For reproducibility
# Number of dice rolls
n_flips <- 10000
# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)
# Prior: Beta(1,5) (weak prior belief about p = 1/6)
alpha <- 1 # prior successes (rolling a 1)
beta <- 2 # prior failures (rolling 2-6)
# Store posterior mean estimates
posterior_means <- numeric(n_flips)
# Bayesian updating
for (i in 1:n_flips) {
alpha <- alpha + (flips[i] == "H") # Increase count if roll == 1
beta <- beta + (flips[i] != "H") # Increase count otherwise
posterior_means[i] <- alpha / (alpha + beta) # Compute posterior mean
}
# Plot posterior mean convergence
plot(1:n_flips, posterior_means, type = "l", col = "blue", lwd = 2,
xlab = "Number of Rolls", ylab = "Posterior Mean of p(rolling a 1)",
main = "Bayesian Law of Large Numbers")
abline(h = 1/2, col = "red", lty = 2, lwd = 2) # True probability reference line\[ 0 \leq P(A) \leq 1 \]
\[ P(S) = 1 \]
\[ \bigcup\limits_{i=1}^{\infty} A_{i} = \Sigma_{i =1}^\infty P(A_i) \]
\[ P(A) + P(A^C) = 1 \]
If we know the the probability that somebody owns a bike is 0.08, then we would know that the probability that somebody does not own a bike is 0.92.
Two processes are independent if knowing about the outcome of one does not help predict the outcome of the other.
Example:
2 flips of a coin
If events \(A\) and \(B\) are from two independent processes:
\[ P(A \cap B) = P(A) \times P(B) \]
The probability of getting 2 heads in 2 flips:
\[ \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]
If mutually exclusive \[ P(B\cup C) = P(B) + P(C) \]
If not mutually exclusive \[ P(B\cup C) = P(B) + P(C) - P(B \cap C) \]
Example: What is the probability of a heart or a king?
\[ P(H\cup K) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \]
The General Social Survey (GSS) is a sociological survey that has been regularly conducted since 1972. It is a comprehensive survey that provides information on experiences of residents of the United States.
| Belief in Life After Death | ||||
|---|---|---|---|---|
| Yes |
No |
Total |
||
|
College Science Course |
Yes | 375 | 75 | 450 |
| No | 485 | 115 | 600 | |
| Total | 860 | 190 | 1050 | |
Let \(B\) represent an event that a randomly selected person in this sample believes in life after death.
Let \(C\) represent an event that a randomly selected person in this sample took a college level science course.
A numeric outcome of a random process is called a random variable.
Describes the probability of 2 or more random variables occurring.
Note that events \(B\) and \(C\) are not mutually exclusive. A randomly selected person can believe in life after death and might have taken a college science course. \(B \cap C \neq \emptyset\)
\[ P(B \cap C) = \frac{375}{1050} \]
Note that \(P(B\cap C) = P(C\cap B)\). Order does not matter.
\(P(B)\) represents a marginal probability. So to does \(P(C)\), \(P(B^C)\), and \(P(C^C)\). In order to calculate these probabilities, we could only use values in the margins of the contingency table:
\[ P(B) = \frac{860}{1050} \]
\[ P(C) = \frac{450}{1050} \]
\(P(B|C)\) represents a conditional probability. So to does \(P(B^C|C)\), \(P(C|B)\), and \(P(C|B^C)\). To calculate the probabilities we focus on the row or the column of the given information. We reduce the sample space to this given information.
Probability that a randomly selected person believes in life after death given that they have taken a college science course:
\[ P(B|C) = \frac{375}{450} \]
One is isolating the effects of a specific variable.
Order Matters!
Likelihood Function
\(P(Y|\theta)\) or \(\mathcal{L}\)\((Y|\theta)\)
The probability of the observed data, given the hypothesis / parameters.
Does NOT sum to 1
Most scientific questions stem from the likelihood function.
Distinction: Downstream interpretations relate to how one constructs (and understands a likelihood).
Option A: \(P(Data | \mathbf{Hypothesis})\) - Hypothesis fixed, data varies - Frequentist
Option B: \(P(\mathbf{Data} | Hypothesis)\) - Data fixed, hypothesis varies - Bayesian